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Let  X  be  a  random  variable  which  for  simplicity  we  shall  assume  to  have  discrete 
values  x  and  which  has  a  probability  distribution  depending  in  a  known  way  on  an  un¬ 
known  real  parameter  A, 

(1)  *(*|X)  =Pr  [X  =  x\A  =  \]  , 


A  itself  being  a  random  variable  with  a  priori  distribution  function 

(2)  G  (X)  =  Pr  [A  ^  X]  . 

The  unconditional  probability  distribution  of  X  is  then  given  by 

(3)  pa{x)  =Pr  [X  =  x]  —  fp{x\  X)  dG{\)  , 


and  the  expected  squared  deviation  of  any  estimator  of  A  of  the  form  <p(X)  is 

(4)  E[<p{X )  -A ]>=E{E[(<p(X)  —  A) 2 1 A  =  X]  } 

=  /2>  (* I  X)  [*>(*)  -X]W(X) 

X 

=sy>  (*  I  X)  [,,(*)  -X]  >dC(X), 

X 

which  is  a  minimum  when  <p(x)  is  defined  for  each  x  as  that  value  y  =  y(x)  for  which 

(5)  /  ( x )  =  fp  (x  |  X)  (y  —  X)  2dG  (X)  =  minimum  . 


But  for  any  fixed  x  the  quantity 

(6)  I(x)  =y*fpdG-2yfp\dG+fp\2dG 


is  a  minimum  with  respect  to  y  when 

(7)  y  = 


fpAdG 
fpdG  ’ 


the  minimum  value  of  I(x)  being 

(8)  /,(*>  -j>(*|X)XW(X) 
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Hence  the  Bayes  estimator  of  A  corresponding  to  the  a  priori  distribution  function  G  of  A 
[in  the  sense  of  minimizing  the  expression  (4)]  is  the  random  variable  <pg(X)  defined  by 
the  function 


(9) 


<Po(x)  = 


fp  (x  |  X)  XdG(X) 


the  corresponding  minimum  value  of  (4)  being 


(10)  ElvoiX)  —  A]!=£/<,(x). 

X 


The  expression  (9)  is,  of  course,  the  expected  value  of  the  a  posteriori  distribution  of  A 
given  X  =  x. 

If  the  a  priori  distribution  function  G  is  known  to  the  experimenter  then  <pG  defined 
by  (9)  is  a  computable  function,  but  if  G  is  unknown,  as  is  usually  the  case,  then  <pa  is 
not  computable.  This  trouble  is  not  eliminated  by  the  adoption  of  arbitrary  rules  pre¬ 
scribing  forms  forG  (as  is  done,  for  example,  by  H.  Jeffreys  [1]  in  his  theory  of  statistical 
inference).  It  is  partly  for  this  reason — that  even  when  G  may  be  assumed  to  exist  it  is 
generally  unknown  to  the  experimenter — that  various  other  criteria  for  estimators  (un¬ 
biasedness,  minimax,  etc.)  have  been  proposed  which  have  the  advantage  of  not  requir¬ 
ing  a  knowledge  of  G. 

Suppose  now  that  the  problem  of  estimating  A  from  an  observed  value  of  X  is  going 
to  occur  repeatedly  with  a  fixed  and  known  p(x  |  X)  and  a  fixed  but  unknown  G(X),  and  let 

(11)  (Al  Xx) ,  (A2,  X2) ,  •  •  • ,  (An,  Xn) ,  •  •  • 

denote  the  sequence  so  generated.  [The  A»  are  independent  random  variables  with  com¬ 
mon  distribution  function  G,  and  the  distribution  of  Xn  depends  only  on  An  and  for  An  = 
X  is  given  by  />(*|X).]  If  we  want  to  estimate  an  unknown  A„  from  an  observed  Xn  and 
if  the  previous  values  Ai,  •  •  • ,  An-i  are  by  now  known,  then  we  can  form  the  empirical 
distribution  function  of  the  random  variable  A, 

....  ~  .. .  number  of  terms  Ai,  *  *  • ,  An-i  which  are  ^  X 

(.12;  (*»-i(.a; - - — : - , 

n  —  1 


and  take  as  our  estimate  of  A„  the  quantity  pn(Xn),  where  by  definition 


(13) 


*.(*)  = 


j>(*|X)  \dGn- i(X) 
/*(*|X)d&-,(X)  ’ 


which  is  obtained  from  (9)  by  replacing  the  unknown  a  priori  G(\)  by  the  empirical 
G*_i(X).  Since  G„_i(X) —*  G(X)  with  probability  1  as  »—►  00 ,  the  ratio  (13)  will,  under 
suitable  regularity  conditions  on  the  kernel  p(x  |  X),  tend  for  any  fixed  x  to  the  Bayes 
function  <pa(x )  defined  by  (9)  and  hence,  again  under  suitable  conditions,  the  expected 
squared  deviation  of  pn(Xn)  from  An  will  tend  to  the  Bayes  value  (10). 

In  practice,  of  course,  it  will  be  unusual  for  the  previous  values  Ai,  •  •  • ,  A„_x  to  be 
known,  and  hence  the  function  (13)  will  be  no  more  computable  than  the  true  Bayes 
function  (9).  However ,  in  many  cases  the  previous  values  Xh-  •  •,  Xn-i  will  he  available  to 
the  experimenter  at  the  moment  when  A»  is  to  be  estimated ,  and  the  question  then  arises 
whether  it  is  possible  to  infer  from  the  set  of  values  X\, •  •  •,  Xn  the  approximate  form 
of  the  unknown  G,  or  at  least,  in  the  present  case  of  quadratic  estimation,  to  approximate 
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the  value  of  the  functional  of  G  defined  by  (9).  To  this  end  we  observe  that  for  any  fixed  x 
the  empirical  frequency 


(14) 


Pn(x)  = 


number  of  terms  Xu  •  •  • ,  Xn  which  equal  x 
n 


tends  with  probability  1  as  n  — *  00  to  the  function  pa(x)  defined  by  (3),  no  matter  what 
the  a  priori  distribution  function  G.  Hence  there  arises  the  following  mathematical  prob¬ 
lem:  from  an  approximate  value  (14)  of  the  integral  (3),  where  p(x  |  X)  is  a  known  kernel, 
to  obtain  an  approximation  to  the  unknown  distribution  function  G ,  or  at  least,  in  the 
present  case,  to  the  value  of  the  Bayes  function  (9)  which  depends  on  G.  (This  problem 
was  posed  in  [4].)  The  possibility  of  doing  this  will  depend  on  the  nature  of  the  kernel 
p(x  |  X)  and  on  the  class,  say  to  which  the  unknown  G  is  assumed  to  belong.  In  order 
to  fix  the  ideas  we  shall  consider  several  special  cases,  the  first  being  that  of  the  Poisson 
kernel 

(15)  />(*|X)  =  *®0,  1,  •  •  • ;  X > 0  ; 

Q  being  the  class  of  all  distribution  functions  on  the  positive  real  axis. 

In  this  case  we  have 

(16)  po(x)  =fp  (*|X)  dG(\)  =  /’°VxX*rfG(X)/x! 

•'0 

and 

/<V*X*+1dG(X) 

(17)  <pg(x) - — - , 

/  e-*X*dG(X) 


and  we  can  write  the  fundamental  relation 

(18)  «■„(«) -(*+1)  •  ■ 

Pa\x) 


If  we  now  define  the  function 

(19)  Vn(x)  -  (*+  1)  =  (*+  1) 


number  of  terms  Xu  •  •  •  ,  Xn 
which  are  equal  to  x  +  1 
number  of  terms  Xu  •  •  *  >  Xn 
which  are  equal  to  x 


then  no  matter  what  the  unknown  G  we  shall  have  for  any  fixed  x 

(20)  <pn  (x)  —*<pa  (x)  with  probability  1  as  n— *  ®. 

This  suggests  using  as  an  estimate  of  the  unknown  A*  in  the  sequence  (11)  the  com¬ 
putable  quantity 

(21)  <Pn(Xn), 
in  the  hope  that  as  n  — *  <» , 

(22)  E[<pn(Xn)  ~  A»1  *  *E  [<p  q  (X)  -A]*. 

We  shall  not  investigate  here  the  question  of  whether  (22)  does  actually  hold  for  the  par¬ 
ticular  function  (19)  or  whether  (19)  represents  the  best  possible  choice  for  minimizing 
in  some  sense  the  expected  squared  deviation.  (See  [8].) 
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It  is  of  interest  to  compute  the  value  of  (10)  for  various  a  priori  distribution  functions 
G  in  order  to  compare  its  value  with  the  expected  squared  deviation  of  the  usual  (maxi¬ 
mum  likelihood,  minimum  variance  unbiased)  Poisson  estimator,  X  itself,  for  which 

(23)  £  (X  —  A) 2  =EA  =  f^XdG  (X) . 

J  o 


Suppose,  for  example,  that  G  is  a  gamma  type  distribution  function  with  density 

(24)  G'  (X)  =C\b~1e~h x;  X,  b,  h  >  0;  C  =  hb/T  (6) . 

By  elementary  computation  we  find  that 

(25) 
and 

(26)  *„(*)  =  £±4,  EWoO 0-AJ!=  6 


£A  =  |,  VarA  =  A 


hence 

(27) 


1  +  A’  J  h(l  +  h)’ 

E[<pa(X)-  Al2  1 


E(X- A)2 


l  +  h 


For  example,  if  b  =  100,  h—  10  then 
(28)  EA=  10,  VarA=  1,  <pQ(x) 


x+  100  E[<pa(X)  -A]2 
11  ’  E  (X  —  A) 2 


1 

ir 


An  even  simpler  case  occurs  when  A  has  all  its  probability  concentrated  at  a  single 
value  X.  In  this  case,  of  course,  the  Bayes  function  is 


(29) 


<Po  (x)  ~  X  , 


not  involving  x  at  all,  and 

(30)  EWo(X)  —  A] 2  =  0  , 
while  as  before 

(31)  £  (X  —  A) 2  =£A  =  X  . 

Here  the  sequence  (11)  consists  of  observations  Xi,-  •  Xn,-  •  •  from  the  same  Poisson 
population  (although  this  fact  may  not  be  apparent  to  the  experimenter  at  the  begin¬ 
ning)  ;  the  traditional  estimator  <p(x)  =  x  does  not  take  advantage  of  this  favorable  cir¬ 
cumstance  and  continues  to  have  the  expected  squared  deviation  X  after  any  number  n 
of  trials. 

As  a  second  example  we  take  the  geometric  kernel 

(32)  p(x  |  X)  =  (1  —  X)  X*  ;  *  =  0,  1,  •  tf<  X<  1  ; 

for  which 


(33) 


Po(x)  =  /**  (1  —  X)  \xdG  (X)  ,  <p0  (x) 


'k—ZA 


/* 1  ( 1  —  X)  \x+ldG{\) 

•'ft 


/  X  \  \  *  /  X  \ 


Po(x  +  1) 
po(x) 


1 
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Here  it  is  natural  to  estimate  An  by  (21)  with  the  definition 


(34) 


¥>»(*)  = 


number  of  terms  Xu  •  •  • ,  X„  which  are  equal  to  x  -4- 1 
number  of  terms  Xu  •  •  • ,  Xn  which  are  equal  to  x 


Our  third  example  will  be  the  binomial  kernel 
(35)  M*U)  -(')v(l-X)-— ;  i  =  0,l,-,r;0£XSl. 


Here  r  is  a  fixed  positive  integer  representing  the  number  of  trials,  X  the  number  of  suc¬ 
cesses,  and  A  the  unknown  probability  of  success  in  each  trial.  G  may  be  taken  as  the  class 
of  all  distribution  functions  on  the  interval  (0,  1).  In  this  case 


(36) 


PG.,W  =SPr(x\\)dGW  .Q/VO-W^W, 

•'o 


<PG,  r  (*)  = 


—  X)  r~xdG  (X) 

•'0 


so  that  we  can  write  the  fundamental  relation 

*  +  1  Pa.  r+  i  (*4*  1) 


(37) 

Let 

(38)  />»,  r(*)  = 


<PQ,  r(x)  — 


f+1  Pa.r(x) 
number  of  terms  Xu  •  •  • ,  X„  which  are  equal  to  x 


x  =  0,  1,  •  •  • ,  r  . 


then  pn.  r(x)  — >  p0,  r(x)  with  probability  1  as  n  — >  ® .  Now  consider  the  sequence  of  ran¬ 
dom  variables 

(39)  X'uXL,  ••• 


where  Xn  denotes  the  number  of  successes  in,  say,  the  first  r  —  1  out  of  the  r  trials  which 
produced  Xn  successes,  and  let 


(40) 


,  .  .  number  of  terms  X[, 

Pn.  r — 1  \  X)  = - - - - 


•  • ,  Xn  which  are  equal  to  x 


then  pn.  r-x(^)  — >  Pa.  r-i(x)  with  probability  1  as  n  — >  <» .  Thus  if  we  set 


(41) 
then 

(42) 


+  1  Pn,  r(X+l) 

<Pn,  r  \ X)  •  /  \  » 

r  Pn  r—  1 \X) 

f..\  .*+1  ^O.r(*+l)_ 

<Pn,  r(X)  - T - 7— s—  =  <PO,  r-1  (^) 

r  Pa,  r-l  (. x ) 


with  probability  1  as  n  — *  oo .  If  we  take  as  our  estimate  of  A„  the  value 


(43)  <Pn,  t  (An) 

then  for  large  n  we  will  do  about  as  well  as  if  we  knew  the  a  priori  G  but  confined  ourselves 
to  the  first  r  —  1  out  of  each  set  of  r  trials.  For  large  r  this  does  not  sacrifice  much  in¬ 
formation,  but  it  is  by  no  means  clear  that  (43)  is  the  “best”  estimator  of  A„  that  could 
be  devised  in  the  spirit  of  our  discussion. 
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As  a  final  example  consider  any  kernel  of  the  “Laplacian”  type 
(44)  p(x\\)  =  e*f(x)h(\). 

We  have 

Pe(x)  =/(x)fe**h(\)  dG  (X)  , 

"45)  wWM*)  -/(x)f\e*i(\)dG(\) 


provided  the  differentiation  under  the  integral  sign  is  justified.  Hence 


(46) 


Perhaps  a  satisfactory  approximation  to  <po(x)  might  be  obtained  by  replacing  Po(x) 
in  (46)  by  a  smoothed  interpolation  based  on  pn(x).  The  kernel  (44)  has  been  con¬ 
sidered  by  M.  C.  K.  Tweedie  [6]  and  I  am  indebted  to  him  for  this  example. 

Until  now  we  have  been  concerned  only  with  approximating  to  the  Bayes  function 
<Po(x)  defined  by  (9).  In  many  cases  we  shall  want  an  approximation  to  some  other  func¬ 
tional  of  the  unknown  a  priori  distribution  function  G;  in  particular  to  G  itself.  We  shall 
make  a  few  remarks  about  this  problem  in  the  general  case  in  which  X  is  not  restricted 
to  discrete  values  but  has  a  distribution  function 


(47)  F(*|X)  =Pr  [X^x\A=\] 

depending  on  the  random  variable  A  whose  distribution  function  G  is  unknown.  The 
unconditional  distribution  function  of  X  is  then 


(48)  F0(x)  =Pr  [X&x]  =  fF  (*  |  X)  dG  (X) , 


and  there  is  assumed  to  be  available  an  infinite  sequence  Xi,  X2,  •  •  •  of  independent  ran¬ 
dom  variables  with  the  common  distribution  function  Fa.  The  empirical  distribution 
function 


(49) 


FAx)  = 


number  of  terms  Xu  •  •  • ,  Xn  which  are  gj  x 
n 


is  known  to  converge  uniformly  to  F0(x)  with  probability  1  as  n  — >  «> . 

Problem:  to  find  in  terms  of  Fn(x)  a  distribution  function  G„(X)  which  will  converge 
as  n— >  oo  to  the  unknown  G(X). 

Let  Q  denote  some  class  of  distribution  functions  to  which  the  unknown  G  is  as¬ 
sumed  to  belong.  {Q  might,  for  example,  be  the  class  of  all  distribution  functions,  or  all 
those  with  total  mass  distributed  on  some  fixed  finite  interval.)  The  correspondence 

(50)  Fa(x)  «/F(*|X)dG(X) 

maps  Q  onto  some  class  of  distribution  functions  which  we  shall  denote  by  We  shall 
assume  that  the  kernel  F(x  |  X)  is  such  that  this  mapping  is  one-to-one.  Now,  since  we 
know  an  approximation  Fn  to  Fa,  it  would  be  natural  to  seek  an  approximation  to  G  by 
solving  the  functional  equation  (50)  for  G  with  Fa  replaced  by  Fn.  Unfortunately,  in 
general  this  will  be  impossible  since  Fn  will  not  belong  to  the  class  [For  example, 
if  F(«|X)  is  continuous  in  x  then  all  elements  of  4?  will  be  continuous,  whereas  Fn  is  a 
step  function.]  However,  we  may  proceed  as  follows.  Let  Ft  be  any  element  of  4?  whose 
distance  (in  the  sense  of  maximum  absolute  value  of  the  difference  for  all  x)  from  F„  is 
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within  €n  — *  0  of  the  minimum  distance  of  Fn  from  4?  (this  is  the  “minimum  distance” 
method  of  Wolfowitz),  and  let  Gn  be  defined  by  the  relation 

(51)  F*  (x)  =  fF  (x  [  X)  dGn  (X) . 

Then  F*  — *  Fa  in  the  maximum  difference  metric,  and  under  suitable  conditions  on  the 
kernel  F(x  |  X)  it  will  follow  that  Gn  —>  G.  We  shall  go  into  this  question  in  more  detail 
elsewhere,  but  at  least  it  indicates  one  possible  way  of  obtaining  an  empirical  approxi¬ 
mation  to  a  “mixing”  distribution  G  from  observations  on  the  “mixed”  distribution  Fa. 
(See  [5]  also.)  This  problem,  special  cases  of  which  have  occurred  several  times  in  the 
statistical  literature  (see,  for  example,  [2],  [4],  [7]  and  pp.  84-102  in  [3]),  awaits  a  satis¬ 
factory  solution  and  seems  to  be  of  considerable  importance. 

I  should  like  to  express  my  appreciation  to  A.  Dvoretzky,  J.  Neyman,  and  H.  Raiffa 
for  helpful  discussions  and  suggestions. 
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